Reconstructing Semantic Structures in Technical Documentation with Vector Space Classification
نویسنده
چکیده
With the increasing popularity of component content management systems, a large part of technical documentation in manufacturing and mechanical engineering is written semantically structured in xml-based information models. Content delivery portals can utilize these information to provide users with advanced retrieval or filtering functions. However, legacy content is often excluded from such granular access due to the lack of semantic structures in archival file formats, as for instance, untagged pdf documents. In this paper we introduce an approach that uses the classification knowledge present in available content components to reconstruct document structures in text extracted from legacy files. The method leverages transitions in classification confidence for distributed text chunks to detect boundaries between content components of different semantic classes. Classification is done using a modified vector space model for technical documentation. To measure confidence we derive a measure based on properties of cosine similarity in multiclass scenarios. We present first results that show a strong correlation of predicted semantic structures and original document outlines and give proposals for further improvement. CCS Concepts •Information systems → Clustering and classification; Content analysis and feature selection;
منابع مشابه
Semantic Indexing of Technical Documentation
This research takes place in an industrial context: the CONTINEW Company. This company ensures the storage and security of critical data and technical documentation. Consequently, it is necessary to organize these documents in order to retrieve quickly critical information. The management of this increasing volume of documents requires document classification which is based on indexing techniqu...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملSpace Vector Modulation Based on Classification Method in Three-Phase Multi-Level Voltage Source Inverters
Pulse Width Modulation (PWM) techniques are commonly used to control the output voltage and current of DC to AC converters. Space Vector Modulation (SVM), of all PWM methods, has attracted attention because of its simplicity and desired properties in digital control of Three-Phase inverters. The main drawback of this PWM technique is 
its complex and time-consuming computations in real-time ...
متن کاملSpace Vector Modulation Based on Classification Method in Three-Phase Multi-Level Voltage Source Inverters
Pulse Width Modulation (PWM) techniques are commonly used to control the output voltage and current of DC to AC converters. Space Vector Modulation (SVM), of all PWM methods, has attracted attention because of its simplicity and desired properties in digital control of Three-Phase inverters. The main drawback of this PWM technique is its complex and time-consuming computations in real-time im...
متن کاملFUZZY HV -SUBSTRUCTURES IN A TWO DIMENSIONAL EUCLIDEAN VECTOR SPACE
In this paper, we study fuzzy substructures in connection withHv-structures. The original idea comes from geometry, especially from thetwo dimensional Euclidean vector space. Using parameters, we obtain a largenumber of hyperstructures of the group-like or ring-like types. We connect,also, the mentioned hyperstructures with the theta-operations to obtain morestrict hyperstructures, as Hv-groups...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016